Xigt: extensible interlinear glossed text for natural language processing
نویسندگان
چکیده
This paper presents Xigt, an extensible storage format for interlinear glossed text (IGT). We review design desiderata for such a format based on our own use cases as well as general best practices, and then explore existing representations of IGT through the lens of those desiderata. We give an overview of the data model and XML serialization of Xigt, and then describe its application to the use case of representing a large, noisy, heterogeneous set of IGT.
منابع مشابه
A Model for Interoperability: XML Documents as an RDF Database
We propose a model for a Resource Description Format (RDF) database for interlinear glossed text (IGT) created from documents encoded in the Extensible Markup Language (XML) using markup metaschemas. A metaschema, constructed using the Semantic Interpretation Language (SIL) (Simons 2004) maps XML-encoded documents to a common semantically rich RDF database. The RDF database in turn can be searc...
متن کاملExtracting Interlinear Glossed Text from LaTeX Documents
We present texigt, a command-line tool for the extraction of structured linguistic data from LTEX source documents, and a language resource that has been generated using this tool: a corpus of interlinear glossed text (IGT) extracted from open access books published by Language Science Press. Extracted examples are represented in a simple XML format that is easy to process and can be used to va...
متن کاملEnriching a massively multilingual database of interlinear glossed text
The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swath of the world’s languages. In many cases this involves bootstrapping the learning process with enriched or...
متن کاملTowards Creating Precision Grammars from Interlinear Glossed Text: Inferring Large-Scale Typological Properties
We propose to bring together two kinds of linguistic resources—interlinear glossed text (IGT) and a language-independent precision grammar resource—to automatically create precision grammars in the context of language documentation. This paper takes the first steps in that direction by extracting major-constituent word order and case system properties from IGT for a diverse sample of languages.
متن کاملInducing grammar from IGT
We suggest a strategy for incremental construction of deep parsing grammars from Interlinear Glossed Text (IGT). IGT is a format of representation where standard linguistics and NLP in principle meet, since they are a data-type which is often available for digitally ‘less resourced languages’ (‘LRL’). The IGT database is TypeCraft (Beermann and Mihaylov 2009, www.typecraft.org), and the grammar...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Language Resources and Evaluation
دوره 49 شماره
صفحات -
تاریخ انتشار 2015